Load dataset

Two different versions of the count matrix exist: the one from GEO and the one created by us personally from cell ranger

##            used  (Mb) gc trigger   (Mb) limit (Mb)  max used  (Mb)
## Ncells  3235143 172.8    5279980  282.0         NA   5279980 282.0
## Vcells 64592090 492.8  142652204 1088.4      16384 129874601 990.9

## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous x-axis

##            used  (Mb) gc trigger (Mb) limit (Mb)  max used   (Mb)
## Ncells  3458735 184.8    5279980  282         NA   5279980  282.0
## Vcells 65798104 502.0  801505012 6115      16384 990044566 7553.5

Decided to go with GEO matrix for analysis (probably no big difference in the end)

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
## An object of class Seurat 
## 15619 features across 3399 samples within 1 assay 
## Active assay: RNA (15619 features, 0 variable features)

Quality control

## An object of class Seurat 
## 15619 features across 2260 samples within 1 assay 
## Active assay: RNA (15619 features, 0 variable features)

Identify highly variable features

## When using repel, set xnudge and ynudge to 0 for optimal results
## Warning: Transformation introduced infinite values in continuous x-axis
## Transformation introduced infinite values in continuous x-axis
## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Scaling the data

## Centering and scaling data matrix

PCA

## PC_ 1 
## Positive:  TXN, LDHB, GAPDH, YBX1, TPI1, MIF, SLC25A5, RPL35, NME1, ENO1 
##     NDUFS6, UQCRQ, RPL36, PRDX2, SNRPE, PRELID1, SSBP1, RAN, LSM4, APRT 
##     PHB, PSMA7, RPL7A, NPM1, PGAM1, LDHA, SHFM1, NHP2, SLIRP, RANBP1 
## Negative:  RPS4Y1, HLA-DPA1, HLA-DPB1, HLA-DQA2, HLA-DRA, HLA-DRB1, HLA-DRB5, HLA-DQA1, CD69, EVL 
##     HLA-DQB1, PMAIP1, CD79A, ITGB2, REL, BANK1, CD83, PTPRC, BIN1, IRF8 
##     MYO1F, CD99, FNBP1, PHACTR1, MYO1G, HCST, NKG7, INPP5D, CELF2, CLEC2B 
## PC_ 2 
## Positive:  IGLC3, MAFB, IGLV2-14, SSR4, IGHG4, DEPTOR, IGHG3, LGALS1, S100A4, IGHG1 
##     GULP1, HGF, PEBP1, SQLE, CBLN2, MT-CYB, TYMP, PIM2, EMP3, HSPA5 
##     PBXIP1, RPLP0, CD44, CAV1, IGHA2, AC074289.1, METTL7A, TSC22D3, NTRK2, EVI2A 
## Negative:  RPS4Y1, ITGB2, CD99, STMN1, HCST, EVL, KIAA0101, NKG7, TMSB4X, PTPRC 
##     GNLY, MYBL2, TYMS, CST7, ZWINT, CCL5, HLA-DPA1, GZMA, RRM2, MYO1F 
##     ASF1B, MKI67, TK1, HMGB2, CTSW, HMGN2, DHFR, HOPX, HLA-DQA2, CD7 
## PC_ 3 
## Positive:  NKG7, GNLY, HCST, ITGB2, GZMA, CCL5, CST7, CTSW, CD7, PRF1 
##     KLRB1, HOPX, CD247, GZMB, EVL, LCK, IFITM2, ID2, KLRF1, MYO1F 
##     TRDC, TMSB4X, MATK, TYROBP, KLRD1, GZMM, CD99, GZMH, SPON2, FCER1G 
## Negative:  STMN1, RRM2, MKI67, KIAA0101, CDCA5, MYBL2, ZWINT, DHFR, ASF1B, TK1 
##     AURKB, SPC25, CDK1, SKA1, CLSPN, KIFC1, TYMS, TOP2A, PBK, HMGB2 
##     WDR34, BIRC5, CENPN, PHF19, CCNA2, HIST1H4C, SKA3, SHCBP1, CENPU, H2AFZ 
## PC_ 4 
## Positive:  NKG7, GNLY, CCL5, GZMA, CTSW, CST7, PRF1, CD7, KLRB1, CD247 
##     GZMB, TRDC, GZMH, KLRF1, KLRD1, MATK, SPON2, LCK, GZMM, TRBC1 
##     IL32, ZAP70, CLIC3, FGFBP2, HCST, KLRC1, AOAH, ADGRG1, ID2, PRKCH 
## Negative:  CD74, HLA-DRA, HLA-DPA1, HLA-DRB1, HLA-DPB1, HLA-DQA2, CD79A, HLA-DRB5, HLA-DQA1, HLA-DMA 
##     HLA-DQB1, CD37, CD52, HLA-DMB, LAPTM5, BANK1, CD22, BLK, CD83, IRF8 
##     RPS4Y1, HHEX, IGHD, CD19, TNFRSF13C, PHACTR1, LRMP, ARID5B, LTB, SWAP70 
## PC_ 5 
## Positive:  LGALS1, S100A6, S100A4, S100A11, ZBP1, MT-CO1, PBXIP1, KLF6, EMP3, ISG20 
##     ADGRE5, DEPTOR, NEAT1, OAS1, ISG15, POU2F2, IFI16, JUN, DUSP1, CRIP1 
##     RP11-81H14.2, S100A10, SAMD9L, EVI2A, LMNA, IL16, HIST1H1C, AHNAK, MX1, IRF7 
## Negative:  HMGA1, HSPE1, NPM1, ECE2, FKBP4, CCND2, EBNA1BP2, TOMM40, EXOSC4, CACYBP 
##     FABP5, ATAD3B, LDHA, LSM7, HSPD1, PA2G4, NME1, PHB, WDR4, CISD3 
##     MCM7, SCARB1, NOP16, CCT2, PAICS, SNRPA1, POLR2H, NOP58, TIMM13, MRPL4

## Find clusters

## Computing nearest neighbor graph
## Computing SNN
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2260
## Number of edges: 71089
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8306
## Number of communities: 7
## Elapsed time: 0 seconds

Plot UMAP

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 14:30:23 UMAP embedding parameters a = 0.9922 b = 1.112
## 14:30:23 Read 2260 rows and found 10 numeric columns
## 14:30:23 Using Annoy for neighbor search, n_neighbors = 30
## 14:30:23 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 14:30:23 Writing NN index file to temp file /var/folders/p4/zn4sp81s5gq35dzmr0nknsjc0000gn/T//Rtmps8fWRq/file21fe7524447
## 14:30:23 Searching Annoy index using 1 thread, search_k = 3000
## 14:30:23 Annoy recall = 100%
## 14:30:23 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 14:30:24 Initializing from normalized Laplacian + noise (using irlba)
## 14:30:24 Commencing optimization for 500 epochs, with 92016 positive edges
## 14:30:27 Optimization finished

Check known cancer marker genes

Specific MM marker genes: B2M, CCND1, Immunoglobulins (IGHM, IGKC)

Source: https://www.cancer.gov/about-cancer/diagnosis-staging/diagnosis/tumor-markers-list

Additional marker used in the Numbat paper for MM: MZB1

Annotate based on the cancer marker genes

##                  p_val avg_log2FC pct.1 pct.2     p_val_adj
## RPS4Y1    0.000000e+00   10.56039 0.058 0.953  0.000000e+00
## HLA-DRB1 8.572363e-274  -85.19576 0.042 0.809 1.338917e-269
## HLA-DPA1 1.149036e-271  -82.28723 0.054 0.844 1.794679e-267
## HLA-DQA2 3.944242e-266  -10.89713 0.006 0.619 6.160511e-262
## HLA-DPB1 3.530689e-252 -137.06263 0.069 0.848 5.514583e-248
## HLA-DQA1 8.267348e-246  -22.22866 0.013 0.619 1.291277e-241
## [1] 3516

Compare annotation with results for SCEVAN and copyKat

##          
##           aneuploid diploid not.defined
##   control        13     208          36
##   MM           1992       4           7
##          
##           filtered normal tumor
##   control       21    223    13
##   MM             6      4  1993
##              
##               filtered normal tumor
##   aneuploid          0      0  2005
##   diploid            0    212     0
##   not.defined       27     15     1